43 research outputs found
Semi-supervised model-based clustering with controlled clusters leakage
In this paper, we focus on finding clusters in partially categorized data
sets. We propose a semi-supervised version of Gaussian mixture model, called
C3L, which retrieves natural subgroups of given categories. In contrast to
other semi-supervised models, C3L is parametrized by user-defined leakage
level, which controls maximal inconsistency between initial categorization and
resulting clustering. Our method can be implemented as a module in practical
expert systems to detect clusters, which combine expert knowledge with true
distribution of data. Moreover, it can be used for improving the results of
less flexible clustering techniques, such as projection pursuit clustering. The
paper presents extensive theoretical analysis of the model and fast algorithm
for its efficient optimization. Experimental results show that C3L finds high
quality clustering model, which can be applied in discovering meaningful groups
in partially classified data
Set Aggregation Network as a Trainable Pooling Layer
Global pooling, such as max- or sum-pooling, is one of the key ingredients in
deep neural networks used for processing images, texts, graphs and other types
of structured data. Based on the recent DeepSets architecture proposed by
Zaheer et al. (NIPS 2017), we introduce a Set Aggregation Network (SAN) as an
alternative global pooling layer. In contrast to typical pooling operators, SAN
allows to embed a given set of features to a vector representation of arbitrary
size. We show that by adjusting the size of embedding, SAN is capable of
preserving the whole information from the input. In experiments, we demonstrate
that replacing global pooling layer by SAN leads to the improvement of
classification accuracy. Moreover, it is less prone to overfitting and can be
used as a regularizer.Comment: ICONIP 201
Estimating conditional density of missing values using deep Gaussian mixture model
We consider the problem of estimating the conditional probability
distribution of missing values given the observed ones. We propose an approach,
which combines the flexibility of deep neural networks with the simplicity of
Gaussian mixture models (GMMs). Given an incomplete data point, our neural
network returns the parameters of Gaussian distribution (in the form of Factor
Analyzers model) representing the corresponding conditional density. We
experimentally verify that our model provides better log-likelihood than
conditional GMM trained in a typical way. Moreover, imputation obtained by
replacing missing values using the mean vector of our model looks visually
plausible.Comment: A preliminary version of this paper appeared as an extended abstract
at the ICML 2020 Workshop on The Art of Learning with Missing Value
Pointed subspace approach to incomplete data
Incomplete data are often represented as vectors with filled missing attributes joined with flag vectors indicating missing components. In this paper, we generalize this approach and represent incomplete data as pointed affine subspaces. This allows to perform various affine transformations of data, such as whitening or dimensionality reduction. Moreover, this representation preserves the information, which coordinates were missing. To use our representation in practical classification tasks, we embed such generalized missing data into a vector space and define the scalar product of embedding space. Our representation is easy to implement, and can be used together with typical kernel methods. Performed experiments show that the application of SVM classifier on the proposed subspace approach obtains highly accurate results
Subspace memory clustering
We present a new subspace clustering method called SuMC (Subspace Memory Clustering), which allows to efficiently divide a dataset D c RN into k 2 N pairwise disjoint clusters of possibly different dimensions. Since our approach is based on the memory compression, we do not need to explicitly specify dimensions of groups: in fact we only need to specify the mean number of scalars which is used to describe a data-point. In the case of one cluster our method reduces to a classical Karhunen-Loeve (PCA) transform. We test our method on some typical data from UCI repository and on data coming from real-life experiments